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Abstract. The automatic ranking of word pairs as per their semantic relatedness and ability to 
mimic human notions of semantic relatedness has widespread applications. Measures that rely 
on raw data (distributional measures) and those that use knowledge-rich ontologies both exist. 
Although extensive studies have been performed to compare ontological measures with human 
S , judgment, the distributional measures have primarily been evaluated by indirect means. This 

paper is a detailed study of some of the major distributional measures; it lists their respective 
merits and limitations. New measures that overcome these drawbacks, that are more in line 

(-yi^ . with the human notions of semantic relatedness, are suggested. The paper concludes with 

an exhaustive comparison of the distributional and ontology-based measures. Along the way, 
significant research problems are identified. Work on these problems may lead to a better 

l_^ ■ understanding of how semantic relatedness is to be measured. 
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1. Introduction 

Humans are inherently capable of determining whether one word pair is more 
semantically related than another. For example, given the word pairs honey- 
bee and paper-car, one can easily identify the former pair to be more se- 
mantically related than the latter. This, however, is not true for machines. A 
lot of work has been done in automating the process in the last fifteen years. 
While some approaches do better than others and have been applied to solving 
practical problems, none has matched human judgment. 

Typically, automated systems assign a score of semantic relatedness to a 
given pair of words (target words) calculated from a relatedness measure. 
The absolute score is usually irrelevant on its own. For example, a relatedness 
score of 0.7 between a and b, in a possible range of to 1 , does not imply 
that a and b are more related than the average word pair. However, given 
that the semantic relatedness of c and d is 0.6, the system can conclude that 
a and b are more related than c and d. Thus even though the absolute score 
given by a relatedness measure is not of much significance, it is important that 
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the measure give a higher score to word pairs which humans think aie more 
related and comparatively lower scores to word pairs that are less related. This 
ability to mimic human judgment of semantic relatedness has been used in 
numerous applications such as automated spelling correction, word sense dis- 
ambiguation, thesaurus creation, information retrieval, text summarization, 
and identifying discourse structure. 

Existing measures of semantic relatedness rely either on ontologies and 
semantic networks or just raw text. Budanitsky (1999), Budanitsky and Hirst 
(2001) and Patwardhan et al. (2003) do an extensive survey and comparison 
of the various WordNet-based measures. Measures that use just raw text, 
known as the distributional measures, have been described individually 
(for example, in Schutze and Pedersen (1997), Hindle (1990), Lin (1998a), 
Pereira et al. (1993), etc) but have not been extensively compared among 
each other. This paper focuses on distributional measures and analyzes their 
strengths and limitations. Particular- attention is paid to the different kinds of 
distributional measures and their components. New measures are suggested 
that overcome some of their drawbacks. Characteristics of WordNet-based 
and distributional measures are contrasted and finally, future research direc- 
tions are suggested which may determine a better understanding of semantic 
relatedness. 



2. Background 



2.1. Co-occurrences 



Words that occur within a certain window of a tai^get word are called the co- 
occurrences of the word. The window size may be a few words on either 
side, the complete sentence, a paragraph or the entire document. Consider the 
sentence below: 



the plane flew through a cloud 



If we consider the window size to be the complete sentence, flew co-occurs 
with the, plane, through, a and cloud. The set of words that co-occur with 
a word constitute the context of the word. They are used in tasks such as 
information retrieval, word sense disambiguation, and semantic relatedness. 
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2.2. Word Association Ratio 

Given two events x and y with probabilities P{x) and P{y), their pointwise 
mutual information (Fano, 1961)\ PMI for short, or just /, is defined as 
follows: 

P{x,y) is the joint probability of ;c and y. \il{x,y) evaluates to be close to zero, 
i,e, P{x,y) ss P{x) x P{y), then it means that events x and y occur together 
just as often as is expected from their individual probabilities. \il{x,y) » 0, 
it implies that x and y occur together more often than would be expected from 
their individual probabilities and hence have a strong correlation. 

Church and Hanks (1989)^ introduce word association ratio, which is 
similar to pointwise mutual information. If x and y are words with proba- 
bilities P(x) and P{y) (estimated by corpus counts), their association ratio is 
defined to be the same as in (1), except that P{x,y) stands for the probability 
that X appears, within a certain window, before y. It should be noted that 
P{x,y) is no longer symmetric {P{x,y) ^ P{y,x)) as P{x,y) and P{y,x) repre- 
sent two different events. If two words have a word association ratio close to 
zero then they do not share an interesting relationship but if I{w\,W2) ^ 0, 
then W2 follows w\ (within a certain window) more often than chance and 
the words w\ and W2 are strong co-occurrences. Theoretically, word associ- 
ation ratio may yield negative values (word pair occurs less frequently than 
expected by random chance) but Church and Hanks (1989) show that it is 
hard to accurately predict negative word association ratios with confidence. 
Systems which use word association ratio may be adversely affected by this. 
A common approach to counter this is to equate the negative association 
values to (for example, Lin (1998a)). This usually means that the system 
will ignore such words. 

A problem with PMI in general (which is inherited by word association 
ratio) is that low frequency events get higher scores than expected. Pantel 
and Lin (2002) try to overcome this by multiplying the PMI value with a 
collection factor. Although, Pantel and Lin give the correction factor for 
word association ratio using syntactically related co-occumng words, a more 
generic form applicable for pointwise mutual information is as shown below: 

T ( \-^ P(jc,y) ^, r[an{freq{x)Jreq{y)) 

^coirected{X,y) — 10g2 , , , , X —- , - TTTTT ^^-' 

P{x)P{y) rmn(freq{x),freq{y)) + l 

The correction factor is large (close to 1) if both the events occur a large 
number of times and small (close to 0) if any of the two events occurs very 
few times. 
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2.3. Relatedness vs Similarity 

A closely related concept to semantic relatedness is semantic similarity. 
While there is some overlap in their meanings and they may be used inter- 
changeably in certain contexts, it is important to be aware of their distinction. 
Budanitsky and Hirst (2001) and Budanitsky and Hirst (2004) point out that 
semantic similarity is used when similar entities such as apples and bananas 
or table scad furniture are compared. These entities are close to each other 
in an is-a hierarchy. For example, apples and bananas are hyponyms of fruit 
and table is a hyponym of furniture. However, even dissimilar entities may be 
semantically related, for example, door and knob, tree and shade, or gym and 
weights. In this case the two entities are not similar per se, but are related by 
some relationship. This relationship may be one of the classical relationships 
such as meronymy (is part of) as in door-knob or a non-classical one as in 
tree-shade and gym-weights. Thus two entities ai^e semantically related if 
they ai^e semantically similar (close together in the is-a hierarchy) or share 
any other classical or non-classical relationships. As Budanitsky and Hirst 
(2004) point out, semantic similarity is a subset of semantic relatedness. 

The concept of semantic distance has traditionally been used in the con- 
text of both semantic relatedness and semantic similarity. In the former con- 
text, it represents the inverse of semantic relatedness, while in the latter, it 
is the inverse of semantic similarity. In this paper as well, we shall continue 
to use the term for both concepts with the confidence that the context will 
disambiguate the intended meaning. 

2.4. The Distributional Hypothesis 

Given a text corpus, individual words have more or less differing contexts 
ai^ound them. The context of a word is composed of words co-occurring with 
it within a certain window around it. Distributional measures use statistics 
acquired from a large text corpora to determine how similar the contexts 
of two words are. These measures are also used as proxies to measures of 
semantic similarity as words found in similar contexts tend to be semanti- 
cally similar. This is known as the distributional hypothesis (Firth (1957) 
and Harris (1968)) and such measures have traditionally been referred to as 
measures of distributional similarity. 

The hypothesis makes intuitive sense as Budanitsky and Hirst (2004) point 
out. If two words have many co-occurring words then similar things are being 
said about both of them and so they are likely to be semantically similar. Con- 
versely, if two words are semantically similar then they are likely to be used in 
a similar fashion in text and thus end up with many common co-occurrences. 
For example, the semantically similar bug and insect are expected to have a 
number of common co-occurring words such as crawl, squash, small, woods, 
and so on, in a large enough text corpus. 
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Like measures of distributional similarity there exist measures of what 
we will call distributional relatedness (Schiitze and Pedersen (1997) and 
Yoshida et al. (2003)). These measures use raw text and co-occurrence in- 
formation to determine semantic relatedness between two words. The distri- 
butional hypothesis mentioned earlier is generic enough to be the basis for 
both distributional similarity and distributional relatedness. We propose more 
specific hypotheses that demarcate the two. 

Hypothesis of distributional similarity: 

Distributionally similar words tend to be semantically similar, where 
two words (wi and W2, say) are said to be distributionally similar if they 
have many common co-occurring words and these co-occurring words are 
each related to wi and W2 by the same syntactic relation. 

Hypothesis of distributional relatedness: 

Distributionally related words tend to be semantically related, where 
two words (wi and W2, say) are said to distributionally related if they have 
many common co-occuning words and this set of co-occurring words is 
not restricted to only those that are related to wi and W2 by the same 
syntactic relation. 

The two hypotheses aie based on the fact that semantically similar words 
belong to the same broad part of speech (noun, verb, etc) and are thus each 
syntactically related to most common co-occurring words by the same syn- 
tactic relation. Further, the more two words are semantically related, the more 
common co-occurring words they have. Consider the semantically related 
word pair doctor-operate. In a large enough body of text, the two words are 
likely to have the following common co-occuning words: patient, scalpel, 
surgery, recuperate, and so on. All these words will be used by a measure of 
distributional relatedness and the pair will be assigned a high score. However, 
a measure of distributional similarity will not use any of these co-occurring 
words (and likely no other, for that matter) as they are not related to the target 
words by the same syntactic relation. The word doctor is almost always used 
as a noun while operate is a verb. Thus doctor and operate will get a very 
low score of distributional similarity. The word pair doctor-nurse, on the 
other hand, will get a high score of distributional relatedness and distributed 
similarity. Thus an important characteristic of any distributional measure is 
whether it is a measure of distributional similarity or more generally that of 
distributional relatedness. 

It should be noted that a measure of distributional similarity will provide 
a high score for certain closely related but dissimilar words belonging to the 
same thematic role. For example, homeless and drunk which refer to dissim- 
ilar concepts but share a non-classical relationship of association {homeless 
and drunk tend to occur together in text) will likely get a high score as they 
belong to the same part of speech (adjective) and may have many common 
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co-occumng words such as beggar, person, helped, and so on, related by 
the same syntactic relation. This is a limitation of the cuixent measures of 
distributional similarity and the impact of the limitation on the ability of the 
measures to mimic semantic similarity is worth determining. 

The relevant literature uses the term distance as inverse of distributional 
similarity. In order to clearly distinguish between semantic distance, this pa- 
per will refer to the inverse of distributional similarity as distributional dis- 
tance. Like semantic distance, distributional distance will also be used as 
the inverse of distributional relatedness, and the context should help disam- 
biguate the intended meaning. 

2.5. Relatedness of Words and Concepts 

Measures of semantic relatedness and similarity are applied to particular con- 
cepts (or particular senses of the words); for example, one may determine the 
semantic relatedness of bank in the financial institution sense and interest 
in the interest rate sense. Distributional measures, on the other hand, usu- 
ally assign scores to word pairs irrespective of the nature of their polysemy 
(how many senses they have) or the particular senses they have been used 
in. Distributional measures will need a much more knowledge rich source 
(for example large amounts of sense-tagged corpora) than raw text to assign 
scores to word-sense pairs. 

2.6. Evaluation 

The presence of a large number of relatedness measures necessitates a suit- 
able evaluation to determine which methods come closest to the human no- 
tions of relatedness and to determine how good they each are. There exist two 
modes of evaluation. The first involves the creation of two ranked lists of cer- 
tain word pairs. One list is created using a relatedness measure while the other 
is ranked by humans. The correlation of the two rankings is indicative of how 
closely the measure mimics human judgment of relatedness. Rubenstein and 
Goodenough (1965) were the first to conduct quantitative experiments with 
human subjects who were asked to rate 65 word pairs on a scale of 0.0 to 4.0 
as per their relatedness. The word pairs chosen ranged from very similar and 
almost synonymous to unrelated. Miller and Charles (1991) also conducted a 
similar study on 30 word pairs taken from the Rubenstein-Goodenough pairs. 
However, lack of large amounts of data from human subject experimentation 
limits the quality of this mode of evaluation. 

The second and a more indirect way of evaluating measures of semantic 
relatedness is by the performance of natural language tasks that use them, 
for example, automatic spelling correction, word sense disambiguation, es- 
timation of unseen bigram (not found in training data) probabilities, and so 
on. 



Distributional Measures as Proxies for Semantic Relatedness 7 

3. Distributional Measures 

3.1. Spatial Metrics 

A popular technique to determine distributional relatedness between two words 
is to map them to points in a multidimensional space such that the distance 
between the two points is an indicator of distributional and thereby semantic 
distance between them. 

Large co-occurrence matrices pertaining to each word, which store the 
set of words that co-occur with it within a certain window size, are created 
from a text corpus. Consider a multidimensional space where the number of 
dimensions is equal to size of vocabulary. A word wi can be represented by 
a point in this space such that the vector w\ from the origin to this point 
has equal positive components in all dimensions corresponding to words that 
co-occur with wi. Similai^ly, vector vv2 can be created for word W2. This sec- 
tion describes three distributional distance metrics that quantify the distance 
between vvi and W2- 

3.1.1. Cosine 

The cosine method (denoted by Cos) is one of the earliest distributional mea- 
sures. Given two words wi and W2, the cosine measure calculates the cosine 
of the angle between wi and W2- If a large number of words co-occur with 
both wi and W2, wi and VV2 will have a small angle between them, the cosine 
will be large, and we get a large relatedness value between them. The cosine 
measure gives scores in the range from (unrelated) to 1 (maximally related). 



Cos[Wi,W2) = T^ — j — r^ — r (3) 

I Wi I X I W2 I 

A limitation of the cosine method in its original form is that all co-occumng 
words are treated the same, irrespective of how often they co -occurred with 
wi and W2- A popular variation (Yoshida et al. (2003), Lee (1999), and Schiitze 
and Pedersen (1997)) that incorporates this information is stated below: 



„ , ^ LwGC{wi)UC{w.){PHwi)xP{w\w2)) 
CoS[Wi,W2) = — j^^^^=^^^^^ (4) 



P(w|wi)2 X J^^.ec(w2)PiMw2)^ 



C{x) is the set of words that co-occur (within a certain window) with the 
word ;c in a corpus. P{x\y) is the probability that a particular co-occurrence 
is composed of x and y, given that word y is one of the words in the co- 
occurrence pair. It can be approximated by simple corpus counts. Once again, 
the formula is the cosine of the angle between the word vectors wi and W2 but 
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the word vectors incorporate the strength of association of the co-occumng 
words with the target words. The component of ^ in a dimension (coixespond- 
ing to word y, say) is equal to the strength of association of y with x. Thus 
the vectors corresponding to two words are closer together, and thereby get a 
high distributional relatedness score, if they share many co-occurring words 
and the co-occurring words have more or less the same strength of associa- 
tion with the two target words. In the above formula conditional probability 
of the co-occuning words given the target words is used as the sti'ength of 
association. 

The cosine is used, among others, by Schiitze and Pedersen (1997) and 
Yoshida et al. (2003), who suggest methods of automatically generating the- 
sauri from text corpora. Schiitze and Pedersen (1997) use the Tipster category 
B corpus (HaiTnan, 1993) (450,000 unique terms) and the Wall Street Journal 
to create a large but sparse co-occuiTence matrix of 3,000 medium-frequency 
words (frequency rank between 2,000 and 5,000). Latent semantic indexing 
and single-value decomposition (see Schiitze and Pedersen (1997) for details) 
are used to reduce the dimensionality of the matrix and get for each term a 
word vector of its 20 strongest co-occurrences. The cosine of a word vector 
(say vvi) with each of the other word vectors is calculated and the top scores 
along with the words whose vector generated the top scores is noted. These 
words form the thesaurus entries for wi. 

Yoshida et al. (2003) believe that words that are closely related for one 
person may be distant for another. They use around 40,000 HTML documents 
to generate personalized thesauri for six different people. Documents used to 
create the thesaurus for a person are retrieved from the subject's home page 
and a web crawler which accesses linked documents. The authors also sug- 
gest a root-mean-squared method to determine the similarity of two different 
thesaurus entries for the same word. 



3.1.2. Manhattan and Euclidean Distances 

Distance between two points (words) in multidimensional space can be cal- 
culated using the Manhattan distance a.k.a. L\ norm (denoted by L\) or 
Euclidean distance a.k.a. L2 norm (denoted by L2). In the Manhattan dis- 
tance (5) (Dagan et al. (1997), Dagan et al. (1999), and Lee (1999)), the 
disparity in strength of association of wi and W2 with each word that they 
co-occur with, is summed. The more the disparity in association, the more is 
the distributional distance between the two words. The Euclidean distance (6) 
(Lee (1999)) employs the root mean squared of the disparity in association to 
get the final distributional distance. Both Li norm and L2 norm give values 
in the range (zero distance or maximally related) and infinity (maximally 
distant or unrelated). 



Distributional Measures as Proxies for Semantic Relatedness 



Li(wi,W2) = ^ I P(w|wi) — P(w|w2) I (5) 

vv'eC(vv'i)UC(H'2) 



L2(wi,W2) = / Y, {P{w\wi)-P{w\w2)f (6) 

y weC{wi)UC{w2) 



The above fomiulae use conditional probability of the co-occurring words 
given the target words as the strength of association. The distributional relat- 
edness of words may be found by taking the reciprocal of the distributional 
distance or similar suitable method. 

Lee (1999) compared the ability of all three spatial metrics to determine 
the probability of an unseen (not found in training data) word pair. The mea- 
sures in order of their performance (from better to worse) were: Li norm, 
cosine, and L2 norm. Weeds (2003) determined the correlation of word pair 
ranking as per a handful of distributional measures with human rankings 
(Miller and Charles word pairs Miller and Charles (1991)). Using verb-object 
pairs from the British National Corpus (BNC), she found the correlation of 
Li norm with human rankings to be 0.39. 



3.2. Set Operations 



Distributional measures, as discussed earlier, aim to determine semantic sim- 
ilarity (or relatedness) using words that co-occur with the target words. The 
problem can be transformed to finding the similarity of two sets (Wi and W2, 
say), where each set has as its members the co-occurring words of the two 
target words (vvi or W2), respectively. One can now use set operations such as 
Jaccard and Dice coefficient to determine the similarity of the two sets and 
thereby, the semantic similarity of the target words. 



Jaccard{wy,W2) = (7) 



\WiUW2\ 

2 X |Wi n 

\Wi\ + \W2\ 



r.- , X 2x|WiniV2| 

Dice{wi,W2) = . — (0) 



Both measures give scores in the range from (unrelated) to 1 (maximally 
related) and will rank word pairs identically. 
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Theorem 1. If the similaiity of word pair one is less than the similarity of 
word pair two, as determined by the Jaccard coefficient, then the similarity of 
word pair one will be less than the similarity of word pair two , as determined 
by the Dice Coefficient. 



Proof. 

Let X be the number of co-occurrences common to word pair one and y the 
number of words that co-occur with just one of the two words in word pair 
one. 

Let / and m be the corresponding values for word pair two. 
Therefore, 



Given, 



To prove. 



Jaccardipair one) 
Dice (pair one) 

Jaccardipair two) 
Diceipair two) 



x + y 

2x 
2x + y 

I 

l + m 
21 

21 + m 



Jaccardipair one) < Jaccardipair two) 

X I 

< 



x+y l+m 
xl + xm < xl +yl 
=^ xm < yl 



Diceipair one) < Diceipair two) 

2x 21 

or, < 

2x + y 21 + m 

or, 4x1 + 2xm < 4x1 + 2yl 

or, xm < yl 



which is true (from (16)). 



(9) 
(10) 

(11) 
(12) 

(13) 

(14) 

(15) 
(16) 

(17) 

(18) 

(19) 
(20) 

n 



Thus, in terms of measuring distributional similarity/relatedness, Jaccard 
and Dice coefficients are identical. Lee (1999) shows that the Jaccard coeffi- 
cient performs better than Li norm in an unseen bigram probability estimation 
task. 
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3.2.1. Pseudo-Fuzzy Metrics 

Simple set operations as stated above do not consider the strength of associa- 
tion of the co-occurring word with the target words. The strength of associa- 
tion can be incorporated into the metrics by considering the co-occurrence 
sets to be pseudo-fuzzy. The degree of membership of each word in the 
pseudo-fuzzy set corresponding to a tai^get word is its strength of association 
with the target word. We call the sets pseudo-fuzzy (and not fuzzy) because 
the range of membership values is now dependent on the measure of associa- 
tion used — conditional probability: to 1, PMI (ignoring negative values^): 
to infinity. Even though conditional probability has a range from to 1 like 
a standard fuzzy set membership function, the conditional probabilities of all 
the words with respect to a paiticulai" target word sum up to 1. This need not 
be (and usually is not) true of the membership values for a regular fuzzy set. 
Use of conditional probability (denoted by CP) as the strength of associa- 
tion and application of Jaccard and Dice coefficient on the pseudo-fuzzy set 
results in the following formulae. Similar to the case of regular- sets, it can 
easily be shown that the Dice and Jaccard coefficients of pseudo-fuzzy sets 
also rank word pairs identically. 

Jaccard'^'^{wi,W2) = '-^^^^ (21) 

Ivyec(wi)uc(w2)max(P(w|wi),P(w|w2)) 

„. cPt ^ 2xi:,,ec(vyi)uc(vy2)min(P(w|wi),P(w|w2)) 
Dice {w\,W2) = ^ — —-^. — ^ -, — I — ^ — (22) 

ZweC{w,)P{w\w\) + Iw£C(w2) P\W\W2) 

2 X I>,gc(wi)uc(w2)min(P(w|wi),P(w|w2)) 

1 + 1 ^ ' 

= Y, mm{P{w\wi),P{w\w2)) (24) 

Observe that the special nature of the membership function forces the Dice 
coefficient to equate to simplified form (24) which is also the numerator 
of the Jaccard coefficient. Since Dice and Jaccard are identical in terms of 
ranking word pairs, use of this simplified form is computationally optimal if 
one decides to use the Dice or Jaccard coefficient with conditional probability 
as the strength of association. 

Dagan et al. (1995) use a weighted version of the Jaccard coefficient on 
pseudo-fuzzy sets with PMI as the strength of association. They do not pro- 
vide quantitative comparison with other distributional measures and do not 
derive their measure as shown above. Viewing co-occurrence information as 
pseudo-fuzzy sets enabling the use of any of the numerous set operations 
to determine distributional similarity is a novel approach. Part of our future 
research is to determine how well such measures fare compared to the others. 
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3.3. Mutual Information-Based Measures 

Hindle (1990) was one of the first to factor the strength of association of 
co-occurring words into a distributional similarity measure. The hypothesis 
is that the more similar the association of co-occurring words with the two 
target words, the more semantically similar they are. Hindle^ used pointwise 
mutual information (PMI) as the strength of association. Consider the nouns 
Hj and Hk that exist as objects of verb v,- in different instances within a text 
corpus. Hindle used formula (25) to determine the distributional similarity of 
rij and n^ solely from their occurrences as object of v,. The minimum of the 
two PMIs captures the similarity in the strength of association of v, with each 
of the two nouns. Note that in case of negative PMI values, the maximum 
function captures the PMI which is lower in absolute value. 

TQm{I{vi,nj),I{vi,nk)), 

if I{vi,nj) > Oand/(v,,«;t) > 

\max{I{vi,nj),I(yi,nk)) |, (25) 

if /(v,-,«y) < and/(v;,?i^) < 

0, otherwise 



Hinobjivi,nj,nk) 



I{n,v) stands for the PMI (word association ratio, to be more precise) be- 
tween the words n and v. Hindle used an analogous formula to calculate the 
distributional similarity {Hinsubj) using the subject-verb relation. The overall 
distributional similarity between any two nouns is calculated by the formula 
(26). 

N 

Hin (« 1 , «2 ) = 52 [Hinobj (v,- , « i , «2 ) + Hirisuhj (v; , « i , «2 ) ) (26) 

The measure gives similarity scores from (maximally dissimilar) to infinity 
(maximally similar). Note that in Hindle's measure, the set of co-occurring 
words used is restricted to include only those words that have the same syn- 
tactic relation with both target words (either verb-object or verb-subject). 
This is therefore a measure of distributional similarity and not distributional 
relatedness. A form of Hindle's measure where all co-occurring words are 
used, making it a measure of distributional relatedness, is shown below: 



Hinrel{w\,W2)= ^ 

wGC(w) 



min(/(w,wi),/(w,W2)), 

if /(w,wi) > 0and/(w,W2) > 
|max(/(w,wi),/(w,W2)) I, (27) 

if /(w,wi) < 0and/(w,W2) < 
0, otherwise 



C{x) is the set of words that co-occur with word x. 

Lin (1998a) suggests a different measure derived from his information 
theoretic definition of similarity (Lin, 1998b). Further, he uses a broad set 
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of syntactic relations apart from subject-verb and verb-object relations and 
shows that using multiple relations is beneficial even by Hindle's measure. 
He first extracts triples of the form {x,r,y) from the partially parsed text, 
where the word x is related to y by the syntactic relation r. If I{x,r,y) is 
the information contained in the proposition: the triple {x,r,y) occurred a 
constant c times, then Lin defines the distributional similarity between two 
words, wi and W2, as follows: 



... ^ L{nw)eT{wi)nT{w.){t{wi,r,w)+I{w2,r,w)) 

Lin{wi, W2) = '-^ — '-/ — n.^ 77 ^ (28) 

T{x) is the set of all word pairs {r,y) such that the pointwise mutual infor- 
mation I{x,r,y), is positive. Note that this is different from Hindle (1990) 
where even the cases of negative PMI were also considered. As mentioned 
earlier. Church and Hanks (1989) show that it is hard to accurately predict 
negative word association ratios with confidence. Thus, co-occurrence pairs 
with negative PMI are ignored. The measure gives similarity scores from 
(maximally dissimilar) to 1 (maximally similar). 

Lin's measure distinguishes itself from that of Hindle in two respects. 
Firstly, he normalizes the distributional similarity between two words {w\ 
and W2) determined by their PMI with common co-occurring words by the 
total PMI of w\ and W2 with the rest of the related words. This is a significant 
improvement as now high PMI of the target words with shared co-occurring 
words does not guarantee a high distributional similarity score. As an addi- 
tional requirement, the target words must have low PMI with words they do 
not both co-occur with. The second difference in the two formulae is that 
Hindle uses a minimum of the PMI between each of the target words and the 
shared co-occurring word, while Lin uses the sum. Taking the sum has the 
drawback of not penalizing for a mismatch in strength of co-occurrence, as 
long as w\ and W2 both co-occur with a word. We suggest a new measure of 
distributional similarity (denoted by Saif) which counters this but keeps the 
normalizing factor of Lin's measure: 

^ .^, . 2xi:( )e7-{H.|)nr{H.,)min(/(wi,r,w),/(w2,r,w)) 
Saif{wx , W2) = :z^ '-JT-T^ 77 IK (29) 

The multiplication by two is done to get scores in the range of to 1 (note that 
the sum in Lin's formula was replaced by a min). The multiplication has no 
effect on the relative ranking of word pairs by their similarities. Also notice 
that like Hindle's measure, both Lin's and mine are measures of distributional 
similarity. Hindle (1990) used a portion of the Associated Press news stories 
(6 million words) to classify the nouns into semantically related classes. Lin 
(1998a) used his measure to generate a thesaurus from a 64-million-word 
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corpus of the Wall Street Journal, San Jose Mercury and AP Newswire. He 
also provides a framework for evaluating automatically generated thesauri by 
comparing them with WordNet-based and Roget-based thesauri. He shows 
that the thesaurus created with his measure is closer to the WordNet and 
Roget-based thesauri than that of Hindle. 

3.3.1. Mutual Information— Based Spatial and Fuzzy Metrics 
Variations of the spatial metrics (equations (4), (5), and (6)) that use point- 
wise mutual information instead of conditional probability as the strength of 
association ai^e possible. Following are the formulae for mutual information- 
based spatial metrics. 

Cos' {wi,W2) = I , (30) 

Lf{wi,W2) = £ |/(w,wi)-/(w,W2)| (31) 

vv'eC(vv'i)UC(w2) 



Lf(wi,W2) = / £ (/(w,wi)-/(w,W2))2 (32) 

y weC(wi)VJC{w2) 

Use of pointwise mutual information as the strength of association in the 
fuzzy metrics (see equations (22) and (21)) discussed earlier results in the 
following: 

MI, N Ivv-ec(w,)uc(w2)min(/(w,wi),/(w,W2)) 

„. MI, N 2xi:^gc( )uc(w2)min(/(w,wi),/(w,W2)) 
Dice (wi,W2 = — z^ -, ^--^^ -, ^ (34) 

Observe that Saif{w\^W2) (equation (29)) equates to DiceMi{w\^W2) if the 
restriction to use only positive pointwise mutual information, is lifted. 

3.4. Relative Entropy-Based Measures 

3.4.1. Kullback-Leibler divergence 

Given two probability mass functions p{x) and q{x), their relative entropy 

{D{p\\q)) is: 

D{p\\q) = Y^p{x)\og^ fovq{x)^0 (35) 

Intuitively, if p{x) is the accurate probabiUty mass function corresponding to 
a random vaiiable X, D{p\\q) is the infonnation lost on approximating p{x) 
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by q{x). In other words, D{p\\q) is indicative of how different the two distri- 
butions are. Relative entropy is also called the KuUback-Leibler divergence 
or the KuUback-Leibler distance (denoted by KLD). 

Pereira et al. (1993) and Dagan et al. (1994) point out that words have 
probabilistic distributions with respect to neighboring syntactically related 
words. For example, there exists a certain probabilistic distribution {d\{P{v\n\)), 
say) of a particular noun n\ being the object of any verb. This distribution can 
be estimated by corpus counts of parsed or chunked text. Let d2 {P{v\n2)) be 
the coixesponding distribution for noun ?i2- These distributions {d\ and ^2) 
define the contexts of the two nouns («i and n2, respectively). As per the dis- 
tributional hypothesis (Harris, 1968), the more these contexts are similar, the 
more are n\ and «2 semantically similar. Thus the KuUback-Leibler distance 
between the two distributions is indicative of the semantic distance between 
the nouns n\ and ?i2- 



KLD{ni,n2) = D{di\\d2) 



I„evb^(v|«i)logfgg forP(v|n2)/0 (36) 

LveVb'(m)uvb-(«2)^(vhi)logS^ forP(v|?i2) /O 



where Vb is the set of all verbs and Vb'{x) is the set of verbs that have x as the 
object. The distributional similarity is determined by taking the reciprocal of 
the KuUback-Leibler distance or similar suitable method. Note that the set of 
co-occurring words used is restricted to include only verbs that each have the 
same syntactic relation (verb-object) with both target nouns. This is therefore 
a measure of distributional similarity and not distributional relatedness. 

It should be noted that the verb-object relationship is not inherent to the 
measure and that one or more of any other syntactic relations may be used. 
The distributional relatedness may even be determined using all words co- 
occurring with the target words. Thus a more generic expression of the KuUback- 
Leibler divergence is as follows: 



KLD{wuW2) = D{di\\d2) 

= Lwey/'(w|wi)log^ forP(w|w2)/0 

= Lwec(H.,)uc(w2)'P(H'|wi)log^|^ forP(w|w2)/0 

(37) 
V is the vocabulary (all the words found in a corpus). C{x), as mentioned ear- 
Uer, is the set of words occurring (within a certain window) with word x. The 
inverse of the distributional distance calculated above yields the distributional 
relatedness of wi and W2- 
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It should be noted that the KuUback-Leibler distance is not symmetric, that 
is, the distance from wi to W2 is not necessarily, and even not likely, the same 
as the distance from vvi to wi. This asymmetry is counter-intuitive to the 
general notion of semantic similarity of words, although Weeds (2003) has 
argued in favor of asymmetric measures. Further, it is very likely that there 
be instances such that P(>vi|v) is greater than for a particular verb v, while 
due to data sparseness or grammatical and semantic constraints, the training 
data has no sentence where v has the object W2- This makes P{w2\v) equal to 
and the ratio of the two probabilities infinite. KuUback-Leibler divergence 
is not defined in such cases but approximations may be made by considering 
smoothed values for the denominator. 

Pereira et al. (1993) use relative entropy to create clusters of nouns from 
verb-object pairs corresponding to a thousand most frequent nouns in the 
Grolier's Encyclopedia, June 1991 version (10 million words). Dagan et al. 
(1994) use KuUback-Leibler distance to estimate the probabilities of bigrams 
that were not seen in a text corpus. They point out that a significant number 
of possible bigrams are not seen in any given text corpus. The probabili- 
ties of such bigrams may be determined by taking a weighted average of 
the probabilities of bigrams composed of distributionally similar words. Use 
of KuUback-Leibler distance as the semantic distance metric yielded a 20% 
improvement in perplexity on the Wall Street Journal and dictation corpora 
provided by ARPA's HLT program (Paul, 1991). 

The use of distributionally similar words to estimate unseen bigram prob- 
abilities will likely lead to erroneous results in case of less-preferred and 
strongly-preferred collocations (word pairs). Inkpen and Hirst (2002) point 
out that even though words like task and job are semantically very similar, 
the collocations they form with other words may have varying degrees of 
usage. While daunting task is a strongly-preferred collocation, daunting job 
is rarely used. Thus using the probability of one bigram to estimate that of 
another will not be beneficial in such cases. 

3.4.2. a Skew Divergence 

a skew divergence (ASD) is a sUght modification of the KuUback-Leibler 
divergence, that obviates the need for smoothed probabilities. It has the fol- 
lowing formula: 

ASD{wi,W2)= y P(w|wi)log--^ — PHwi) — ^ 

(38) 
a is a parameter that may be varied but is usually set to 0.99. Note that 
the denominator within the logarithm is never zero with a non-zero numera- 
tor. Also, the measure retains the asymmetric nature of the KuUback-Leibler 
divergence. 
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Lee (2001) shows that a skew divergence performs better than Kullback- 
Leibler divergence in estimating word co-occurrence probabilities. Weeds 
(2003) achieves a correlation of 0.48 and 0.26 with human judgment on the 
Miller and Charles word pairs using ASD{wi,W2) and ASD{w2,w\), respec- 
tively. 

3.4.3. Jensen-Shannon Divergence 

A relative entropy-based measure that overcomes the drawback of asymme- 
try in Kullback-Leibler divergence is the Jensen-Shannon divergence a.k.a. 
total divergence to the average a.k.a. information radius. It is denoted by 
JSD and has the following formula: 

JSD{wi,W2) = DU\\^{di+d2)]+D(d2\\^{di+d2)] (39) 

Ef or I M P{w\wi) 
P(w|wi)log-j h 

weC(wi)UC(w2) V 2 {P{W\W\)+P{W\W2)) 



P{w\w2) 
■ (P(w|wi) +P(w|w2)) 



P(w|w2)log ^ ,„, , \ , ;, , ,, I (40) 



The Jensen-Shannon divergence is the sum of Kullback-Leibler divergence 
between each of the individual distributions di and d2 with the average distri- 
bution C^'^" ). Further, it can be shown that the Jensen-Shannon divergence 
avoids the problem of zero denominator as in Kullback-Leibler divergence. 
The Jensen-Shannon divergence is therefore always well defined and, like a 
skew divergence, obviates the need for smoothed estimates. 

The Kullback-Leibler divergence, a Skew Divergence, and Jensen-Shannon 
divergence all give distributional distance scores from (maximally simi- 
lar/related) to infinity (completely dissimilar/unrelated). 

3.5. Co-occurrence Retrieval Models 

The distributional measures suggested by Weeds (2003) are based on the 
notion of substitutability. The more appropriate it is to substitute word w\ 
in place of word W2 in a suitable natural language task, the more semantically 
similar they are. The natural language task she focuses on is co-occurrence 
retrieval (the retrieval of words that co-occur with a target word from text) 
and depending on the definition of appropriate she suggests six different 
distributional measures called the co-occurrence retrieval models (CRMs). 
Let A'^i be the set of co-occurrences of wi retrieved from a text corpus 
and A'^2 that of W2- In order to determine how appropriate it is to substitute 
wi in place of W2 we have to decide how important it is to get as many co- 
occurrences as possible listed in N2 (recall, denoted by R) and how important 
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it is to not get co-occuiTences not listed in A'^2 (precision, denoted by P). 
Thus Weeds' distributional measures have a precision component and a recall 
component. The final score is a weighted sum of the precision, recall and 
standard F measure (see equation (41)"*). The weights detemiine the impor- 
tance of precision and recall and are determined empirically. If precision and 
recall ai^e equally important, then we get a symmetric measure which gives 
the same scores to the distributional similarity of wi with wi and wj with wi. 
Otherwise, we get an asymmetric measure which assigns different similarities 
to the two cases. As substitutability is defined as a measure of distributional 
similarity, metrics such as precision and recall which quantify how good the 
substitution is, ai^e used to calculate the distributional similarity. 



CRM{wi,W2) =Y 



IxPxR 
P + R 



+ (1-Y) 



P[P] + (1-P)[/?] 



(41) 



Y and (3 are tuned parameters that lie between and 1. 

Weeds argues that the asymmetry in substitutability is intuitive as in many 
cases it may be okay to substitute a word, say dog, with another, say animal, 
but the reverse is not likely to be acceptable as often. Since substitutabil- 
ity is a measure of semantic similarity, she believes that distributional sim- 
ilarity between two words should reflect this property as well. Hence, like 
the Kullback-Leibler divergence, all her distributional similarity models are 
inherently asymmetric. 

A word's co-occurrence information may be specified by the set of co- 
occurring words alone, or by specifying the strength of co-occuiTcnces, as 
well. This strength may be captured by a suitable measure of word association 
such as conditional probability or pointwise mutual information between the 
co-occurring words and the target words. Also, the difference in the strength 
of co-occurrence may or may not be used to penalize the substitutability of 
one word for another. Weeds (2003) provides six distinct formulae for preci- 
sion and recall, depending on the the strength of co-occurrence and penalty 
for differences in strength of association. 

The precision (or recall) can be considered as the product of a core preci- 
sion (or recall) formula (denoted by core) and a penalty function (denoted by 
penalty). The CRMs that use simple counts of the common co-occurrences 
in A'^i and A'^2 and not the strength of associations as core precision and recall 
values are called type-based CRMs (denoted by the superscript type). The 
CRMs that use conditional probabilities of the common co-occurrences in Ni 
and A'^2 with the target words as core precision and recall values are called 
token-based CRMs (denoted by the superscript token). The CRMs that use 
pointwise mutual information of the common co-occurrences in A'^i and A'^2 
with the target words as core precision and recall values are called mutual 
information-based CRMs (denoted by the superscript mi). The core preci- 
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sion and recall fonnulae for type, token and mutual information-based CRMs 
ai^e listed below: 



core 



type 



core 



type 



(wi,W2) 



(wi,W2) 



coref^"(w,,W2) 



core 



token I 
R ' 



corcr 



corcr 



Wl,W2j 

(^1,-^2) 
(wi,W2) 



|c(wi)nc(w2) I 

|C(wi)| 

|c(wi)nc(w2) I 

I C(W2) I 

^ P(w|wi) 
wec(»'i)nc(»'2) 

52 ^(^1^2) 

vyeC(»'i)nc(»'2) 



(42) 

(43) 
(44) 

(45) 

(46) 
(47) 



The CRMs that do not penalize difference in strength of co-occurrence are 
called additive CRMs (denoted by the subscript add). The CRMs that do pe- 
nalize are called difference-weighted CRMs (subscript dw). The penalty is a 
conditional probability-based function (48, 49) for the token- and type-based 
CRMs, and a mutual information-based function (50, 51) for the mutual 
information-based CRM. 



penaltyp^'^ 
penalty^^'^ 



penalty f'" 



: penalty'^'''" 



penaltyp' 
penalty'^' 



min(P(w|wi),P(w|w2)) 

P{w\wi) 
min(P(w|wi),P(w|w2)) 

P(w|w2) 

min(/(w,wi),/(w,W2)) 

/(w,wi) 
min(/(w,wi),/(w,W2)) 



(48) 
(49) 
(50) 
(51) 



/(w,W2) 

The precision and recall of additive and difference-weighted CRMs is listed 
in the appendix. 

Weeds (2003) extracted verb-object pairs of 2,000 nouns from the British 
National Corpus (BNC). The verbs related to the target words by the verb- 
object relation were used. Thus each of the co-occurring verbs is related to the 
target nouns by the same syntactic relation and therefore the measures capture 
distributional similarity, not relatedness. Correlation with human judgment 
(Miller and Charles word pairs) showed that difference-weighted (0.61) and 
additive mutual information-based measures (0.62) performed far better than 
the rest of the CRMs. 
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4. Discussion and Analysis of Distributional Measures 

The previous section described numerous distributional measures. Variations 
of the measures are possible depending on certain general properties of a dis- 
tributional measure. This section discusses a few of the important properties 
along with an analysis of their effect in assigning semantic relatedness. 

4.1. Simple Co-occurrences vs Syntactically Related Words 

Harris (1968), one of the early proponents of the distributional hypothesis, 
used syntactically related words to represent the context of a word. However, 
the strength of association of any word appearing in the context of the tai^get 
words may be used to determine their distributional similarity. Dagan et al. 
(1997), Lee (1999), and Weeds (2003) represent the context of a noun with 
verbs whose object it is (single syntactic relation), Hindle (1990) represents 
the context of a noun with verbs with which it shares the verb-object or 
subject- verb relation, while Lin (1998a) uses words related to a noun by 
any of the many pre-decided syntactic relations to determine distributional 
similarity. Schiitze and Pedersen (1997) and Yoshida et al. (2003) use all co- 
occurring words in a pre-decided window size. Although Lin (1998a) shows 
that the use of multiple syntactic relations is more beneficial as compared to 
just one, there exist no published results on whether using only syntactically 
related words (as compared to all co-occurrences) improves or worsens the 
quality of semantic similarity assignment. 

Use of syntactically related words entails the requirement of chunking or 
parsing the data. Once the data is suitably parsed, the computational cost of 
such methods is lower as distributional similarity is determined with much 
fewer words. 

4.1.1. Use of Multiple Syntactic Relations 

Lin (1998a) used a subset of words that co-occurred with the target words to 
determine their distributional similarity. Only those co-occurrences that are 
syntactically related (by any of the pre-decided list of relations) to the target 
words are chosen. Once this restricted set of co-occurrences is determined, 
distributional similarity is determined by formula (28) shown earlier. Observe 
that the formula does not distinguish between the co-occurrences related by 
different syntactic relations. An alternative is to calculate a distributional 
similarity value using each of the syntactic relations individually and then 
determine the overall distributional similarity from these results. The over- 
all distributional similarity may be as simple as the average similarity (see 
(52)) or the maximum (see (53)) of individual similarity results. Distribu- 
tional similarity so calculated is justified in the following two paragraphs, 
respectively. 



Distributional Measures as Proxies for Semantic Relatedness 21 



Simoveranjivg{wi,W2) = — (^/m,-! (wi , W2) + 5/mr2(wi , W2) + 

...+S/m,.iv(wi,W2)) (52) 

SimoveranMax{wi,W2) = max(5'/mrt (wi , W2),S/m^2(H'l , W2), 

...,SimrN{wi,W2)) (53) 

where A'^ is the total number of syntactic relations considered and, 

^. , , L{n.w)eT{w0nT(w2)iKwi^ri,w)+I{w2,ri,w)) 
Simri{wi,W2) = - ■ ,. , „ 77 —^ (54) 

where ri is a particular syntactic relation. 

Consider the scenario where word w' has a strong word association ra- 
tio (large MI value) with wi but does not co-occur with W2- The large MI 
value is added to the denominator as per Lin's measure (28). This results 
in a low distributional similarity value. However, a number of words are 
considered semantically related even though there exist words (exclusive co- 
occurrences, say) that have strong word association ratios with one or the 
other target word but not both. A mark of semantically related words is the 
presence of a number of common co-occurring words with whom they are 
both strongly associated. One or few strong co-occurrences of a target word 
that do not co-occur with the other target word do not imply that the target 
words are semantically unrelated. For example, consider the rather similar 
pair of nouns bananas and mangoes. The adjective juicy is likely to have a 
large association ratio with mangoes but not so with bananas. The large MI 
value of mangoes and juicy may lead to an excessively low distributional sim- 
ilarity value as per Lin's measure (28). Averaging the different distributional 
similarity values (as in (52)) calculated from individual syntactic relations 
instead of Lin's original method moderates the strongly negative effect of 
such exclusive co-occurrences by restricting it to a particular syntactic rela- 
tion (in this case, adjective-noun). It should be noted that the disparity in the 
strength of association of mangoes and juicy versus banana and juicy, is use- 
ful in bringing out the differences between mango and banana which may be 
used to determine that mango and orange are more semantically related than 
mango and banana. However, as pointed out earlier, we do not want a strong 
co-occunence to have an adverse affect on the estimation of distributional 
similarity in all other cases. 

Taking the maximum of individual distributional similarity values (53) 
takes the aforementioned idea one step ahead and is grounded in the following 
hypothesis: 

Different syntactic relations are accurate predictors of the semantic simi- 
larity for different pairs of words. 
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For example, fruits tend to have strong word associations with adjectives 
like sweet, bitter, ripe and juicy, and low association values with verbs that 
they are related to by the subject-verb relation. For example, consider the 
sentences: 

the ripe mango fell to the ground 
the ripe plum fell to the ground 

The words ripe and mango are related by the adjective-noun relation and are 
likely to have a large value of association. On the other hand, mango and fell 
which are the subject and verb, respectively, are likely to have a low measure 
of association because almost anything can fall. The adjective-noun relation 
is thus expected to yield a higher distributional similarity value than the 
subject-verb relation. Employing equation (53) in this case will mean that co- 
occurrences related to the target words by the adjective-noun relation will be 
used to determine the distributional similarity while all other co-occunences 
will be ignored. Thus only the relation that has the strongest associated co- 
occurrences is used to determine the distributional similarity as these co- 
occurrences ai^e expected to be the best predictors of semantic similarity. A 
measure where other sets of co-occurrences, which are weak predictors of 
semantic similarity, are allowed to influence the result may cause more harm 
than benefit. Flipping the argument on its head, target words predicted to be 
strongly distributionally similar by two or more syntactic relations should be 
assigned higher distributional similarity values than in the case of just one. 
Using the maximum method will loose out on this information. 

Part of our future work will be to determine if calculating individual sim- 
ilarity values from different syntactic relations and then arriving at the final 
similarity is closer to human judgment or not. Also, as pointed out, both the 
average or maximum approaches have their advantages and disadvantages. 
It will be interesting to determine which method gives semantic similarity 
values closer to the human notion of semantic similarity. 

4.2. COMPOSITIONALITY 

The various measures of distributional similarity may be divided into two 
kinds as per their composition. In certain measures each co-occurring word 
contributes to some finite calculable distributional distance between the target 
words. The final score of distributional distance is the sum of these con- 
tributions. We will call such measures compositional measures. The rel- 
ative entropy-based measures, Li norm and L2 norm fall in this category. 
On the other hand, the cosine measure along with Hindle's and Lin's mutual 
information-based measures belong to the category of what we call non- 
compositional measures. Each co-occurring word shared by both target words 
contributes a score to the numerator and the denominator. Words that co-occur 
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with just one of the two tai^get words contribute scores only to the denom- 
inator. The ratio is calculated once all co-occurring words are considered. 
Thus the distributional distance contributed by individual co-occurrences is 
not calculable and the final semantic distance cannot be broken down into 
compositional distances contributed by each of the co-occurrences. 

It must be noted that it is not clear as to which of the two kinds of mea- 
sures (compositional or non-compositional) resembles human judgment more 
closely and how much they differ in their ranking of word pairs. Our future 
work aims to determine this. 

4.2. 1. Primary Compositional Measures 

The compositional measures of distributional similarity (or relatedness) cap- 
ture the contribution to distance between the target words {w\ and W2) due to 
a co-occuning word by three primaiy mathematical manipulations of the co- 
occurrence distributions {d\ and d2)'- the difference, denoted by Dif{as in L\ 
norm), division, denoted by Div (as in the relative entropy-based measures) 
and product, denoted by Pdt (as in the conditional probability or mutual 
information-based cosine method). We will call the three types of compo- 
sitional measures primary compositional measures (PCM). Their form is 
depicted below: 

Dif = £ \P{w\wi)-P{w\w2)\ (55) 

vv'eC(vv'i)UC(vv'2) 

P{w\wi) 



Div = £ 

weC{wi)UC{w2) 



log 



P{w\w2) 



(56) 



pdt= y nMm)^p{w\w2) ^3^^ 

^r-/ \, tr-i \ Scaling Factor 

M'eC(M'l)UC(M'2) 

Observe that by taking absolute values in expressions (55) and (56), the varia- 
tion in the distributions for different co-occuning words has an additive affect 
and not one of cancellation. This corresponds to our distributional hypothesis 
— the more the disparity in distributions, the more is the semantic distance 
between the target words. The product form (57) also achieves this and is 
based on the theorem: 

The product of any two numbers will always be less than or equal to the 
square of their average. 

In other words, the more two numbers are close to each other in value, the 
higher is the ratio of their product to a suitable scaling factor (for example, the 
square of their average). Note that the difference and division measures give 
higher values when there is large disparity between the strength of association 
of co-occurring words with the target words. They are therefore measures of 
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distributional distance and not distributional similarity. The product method 
gives higher values when the strengths of association ai^e closer, and is a 
measure of distributional relatedness. 

Although all three methods seem intuitive, each produces different distri- 
butional similarity values and more importantly, given a set of word pairs, 
each is likely to rank them differently. For example, consider the division 
and difference expressions applied to word pairs (wi, wi) and (w3, W4). For 
simplicity, let there be just one word w' in the context of all the words. Given: 



P{w' 


|wi) 


= 0.91 


P{w' 


|W2) 


= 0.80 


P{w' 


IW3) 


= 0.60 


P{w' 


|W4) 


= 0.50 



The distributional distance between word pairs as per the difference PCM: 



Dif{w\,W2) 

D//'(W3,W4) 



10.91 -0.8| 
j0.6-0.5j = 



= 0.11 
0.1 



The distributional distance between word pairs as per the division PCM: 



D/v(wi,W2) 
D/v(w3,W4) 



, 0.91 


^^^ 0.8 


, 0.6 
^"^0.5 



= 0.056 
0.079 



Observe that for the same set of co-occurrence probabilities, the difference- 
based measure ranks the (h'3,W4) pair more distributionally similar (lower 
distributional distance), while the division-based measure gives lower distri- 
butional similarity values for word pairs having large co-occurrence proba- 
bilities. This behavior is not intuitive and it remains to be seen, by exper- 
imentation, as to which of the three, difference, division or product, yields 
distributional similarity measures closest to human notions of semantic simi- 
larity. 

The L\ norm is a basic implementation of the difference method. A simple 
product-based measure of distributional similarity is as proposed below: 



Pd&^{w,^w,)= y PHwOxP(wM 

wec(w0uc(w.)(i(^(>vki)+^(>vk2)))2 



(58) 



The scaling factor used is the square of the average probability. It can be 
proved that if the sum of two variables is equal to a constant {k, say). Their 
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values must be equal to k/2 in order to get the largest product. Now, let k be 
equal to the sum of P(w|wi)/(P(w|wi) +P(w|w2)) and P{w\w2) / {P{w\w\) + 
P{w\w2))- This sum will always be equal to 1 and hence the product (Z) 
will be lai^gest only when the two numbers ai^e equal i,e, P(w|wi) is equal 
to P{w\w2)- In other words, the farther P(w|wi) and P{w\w2) are from their 
average, the smaller is the product Z. Therefore, the measure gives high scores 
for low disparity in strengths of co-occurrence and low scores otheiivise. 
The incorporation of 2 in the scaling factor results in a measure that ranges 
between and 1 . 

The relative entropy-based methods use a weighted division method. Ob- 
serve that both Kullback-Leibler divergence (fomiula repeated here for con- 
venience — equation (59)) and Jensen-Shannon divergence do not take ab- 
solute values of the division of co-occurrence probabilities. This will mean 
that if P(w|wi) > P{w\w2), the logarithm of their ratio will be positive and 
if P(w|wi) < P{w\w2), the logarithm will be a negative number. Therefore, 
there will be a cancellation of contributions to distributional distance by words 
that have higher co-occurrence probability with respect to w\ and words 
that have a higher co-occurrence probability with respect to W2- Observe 
however that the weight /'(wlvvi) multiplied to the logarithm means that in 
general the positive logarithm values receive higher weight than the negative 
ones, resulting in a net positive score. Therefore, with no absolute value of 
the logarithm, as in the KLD, the weight plays a crucial role. A modified 
Kullback-Leibler divergence (D^^'^) which incorporates the absolute value is 
suggested in equation (60): 

KLD{wi,W2)=D{di\\d2)= y P{w\wi)log^^ (59) 

vveC(vvT)^C(vv,) ^(^1^2) 



KLD^''-\wi,W2)=D^''\di\\d2)= £ P{w\wi) 

weC{wi)UC{w2) 



P{w\wi) 
log 



P(w|>V2) 

(60) 



The updated Jensen-Shannon divergence measure will remain the same as in 
equation (39), except that it is a manipulation of Z>*^* and not the original 
Kullback-Leibler divergence (relative entropy). 

JSD^'"{wuW2)=D^'"{d,\\^{di+d2))+D^'-\d2\\\(d,+d2)) (61) 

Note that once the absolute value of the logarithm is taken, it no longer makes 
much sense to use an asymmetric weight {P(w\wi)) as in the KLD or as 
necessary to use a weight at all. Equation (62) shows a simple division-based 
measure. It is an unweighted form of KLD^'"'{wi,W2) and so we will call it 
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KLDf:„{wuW2) = D/v(wi,W2) = £ 

vyeC(M.'i)UC(w2) 



P(w|wi^ 
log- 



' P{w\w2) 



(62) 



Experimental evaluation of these suggested modifications of Kullback-Leibler 
divergence and informations radius is part of future work. 

4.2.2. Weighting the PCMs 

The performance of the primary compositional measures may be improved 
by adding suitable weights to the distributional distance contributed by each 
co-occuiTcnce. The idea is that some co-occuiTcnces may be better indicators 
of semantic distance than others. Usually, a formulation of the strength of as- 
sociation of the co-occuning word with the target words is used as weight, the 
hypothesis being that a strong co-occurrence is likely to be strong indicator 
of semantic distance. 

Weighting the primary compositional measures results in some of the ex- 
isting measures. For example, as pointed out earlier, the Kullback-Leibler 
divergence is a weighted form of the division measure (not considering the 
absolute value). Here, the conditional probability of a co-occuning word with 
respect to the first word (P(>v|wi)) is used as the weight. Since the weight 
is dependent on the first word and not the other, we have asymmetry. A 
more symmetric weight could be the average of the conditional probabil- 
ities between the co-occuning word and each of the two target words. A 
symmetrically weighted division PCM Saif^^y^j is shown below: 



Saij^,'gWt(wi,W2) = Y. i^{P{w\wi)+P{w\w2)) 

w^c{w\)yjc(w2) 



P{w\w\ 
log 



P{w\w2) 

(63) 



We can have conesponding, symmetric weighted Jensen-Shannon divergence 
and a skew divergence. L2 norm is a weighted version of the L\ norm, the 
weight being: /"(wlwi) —P{w\w2)- A simple product measure with weights is 
shown below: 

l/„/ , N „/ , NX Piw\w\)xP(w\w2) 

P(h'|wi) X P{w\w2 



Pd^^swt = I ^(^(>v|wO+P(w|w2))- 

,2 {^{P{w\wi)+P{w\w2))y 



weC{wi)UC{w2) 



weC{wi)UC{w2) 2 



'(P(w|wi)+P(w|w2)) 



(64) 



A better weight (which is also symmetric) may be chosen given the fol- 
lowing hypothesis: 

The stronger the association of a co-occurring word with a target word, 
the better indicator of semantic properties of the target word it is. 
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The co-occuning word is likely to have different strengths of associations 
with the two tai^get words. Taking the maximum of the two as the weight 
(Dagan et al. (1995)) will mean that more weight is given to a co-occurring 
word if it has high strength of association with any of the two target words. 
As Dagan et al. (1995) point out, there is strong evidence for dissimilarity 
if the strength of association with the other target word is much lower than 
the maximum, and strong evidence of similarity if the strength of association 
with both target words is more or less the same. Equation (65) is a weighted 
division PCM that captures this intuition. 



^ max(/'(w|wi),P(w|w2)) 



n.'eC(n.'i )UC(vv2) i^w'eCiwi )UC{W2) 



max(P(w'|wi),P(w'|w2)) 



P{w\wi) 
log 



P(w|w2) 

(65) 



Similarly weighted product and difference measures may be created. Both 
Saif^l^y^f and Saif^'^f give distributional distance scores from (maximally 
similar/related) to infinity (completely dissimilar/unrelated). 

It would be interesting to note the effect of weighting on these measures 
and also to determine which weight factor is more suitable. 

4.3. Measure of Association 

As mentioned earlier, distributional measures use the disparity in association 
of the target words with their co-occurring words to determine relatedness. 
Lin (1998a) and Hindle (1990) use pointwise mutual information as the mea- 
sure of association. The mutual information-based CRMs of Weeds (2003) 
also use the same. All other measures studied in this paper use simple condi- 
tional probability of the co-occurring words, given the target word. It should 
be noted that replacing the strength of association in a measure with an- 
other can result in a different distributional measure. For example, the mutual 
information-based spatial and fuzzy metrics discussed earlier. Lin's measure 
(28) using conditional probability (CP) is shown below: 



Lin {Wx,W2) = - -^—. — 7-—= -r-n — r (66) 

Of course, in case of certain measures, for example the division-based 
primary compositional measures, use of pointwise mutual information and 
conditional probability is equivalent. 
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Div'^\Wl,W2) 



I 

H'eC(H'i)uc(»'2) 

I 

wec(wi)uc(H'2) 

Div(wi,W2) 



log 



log 



P{w,W[ ) 
P{w)P(wi) 



P{w,W2 ) 
P{W)P(W2) 

P{w\wi) 



P{w\w2) 



(67) 

(68) 
(69) 



Weeds (2003) shows that her mutual infomiation-based CRMs exhibit 
higher correlation with human judgment on the Miller and Charles word pairs 
compared to the ones that use conditional probability. It remains to be seen if 
other measures follow the same pattern. 

4.4. Predictors of Semantic Relatedness 



Given a pair of target words, the vocabulary may be divided into three sets: 
(1) the set of words that co-occur with both target words (common); (2) words 
that co-occur with exactly one of the two target words (exclusive); (3) words 
that do not co-occur with either of the two tai^get words. Hindle (1990) uses 
evidence only from words that co-occur with both target words to determine 
the distributional similarity. All the other measures discussed in this paper so 
far, use words that co-occur with just one target word, as well. 

One can argue that the more there are common co-occurrences between 
two words, the more they are related. For example, drink and sip may be 
considered related as they have a number of common co-occuiTcnces such 
as water, tea and so on. Similarly, drink and chess can be deemed unrelated 
as words that co-occur with one, do not with the other. For example, water 
and tea do not usually co-occur with chess, while castle and move are not 
found close to drink. Measures that use all co-occurrences (common and 
exclusive) tap into this intuitive notion. However, certain strong exclusive co- 
occurrences can adversely effect the measure. Consider the classic strong tea 
vs powerfid tea example (Halliday (1966)). The words strong and powerful 
are semantically very related. However, the word coffee is likely to co-occur 
with strong but not with powerful. Further, strong and coffee can be expected 
to have a large value of association as given by a suitable measure, say PMI. 
This large PMI value, if used in the distributional relatedness formula, can 
greatly reduce the final value. Thus it is not clear if the benefit of using all 
co-occurrences is outweighed by the drawback pointed out. 

A further advantage of using only common co-occurrences is that the 
Kullback-Leibler divergence can now be used without the need of smoothed 
probabilities. 
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KLDcom{wuW2) = y /'(w|w,)log^^4^ (70) 

w£C(w,)nC(w2) P[W\W2) 

Observe that we are taking the intersection of the set of co-occurring words 
instead of union as in the original formula (37). 

4.5. Capitalizing on Asymmetry 

Given a hypemym-hyponym pair (automobile-car, say) asymmetric distribu- 
tional measures such as the Kullback-Leibler divergence, a skew divergence 
and the CRMs generate different values as the distributional similarity of vvi 
with W2 as compared to W2 with wi. Usually, if wi is a more generic concept 
than H'2, the measures find vvi to be more distributionally similar to W2 than 
the other way round. Weeds (2003) argues that this behavior is intuitive as it 
is more often okay to substitute a generic concept in place of a specific one 
than vice versa, and substitutability is a indicator of semantic similarity. On 
the other hand, in most cases the notion of asymmetric semantic similarity is 
counter-intuitive, and possibly detrimental. Further, in case two words share a 
hypernym-hyponym relation, they are likely to be highly semantically similar 
Thus given two words, it may make sense to always choose the higher of the 
two distributional similarity values suggested by an asymmetric measure as 
the final distributional similarity between the two. This way an asymmetric 
measure (SimAsym) can easily be converted into a symmetric one {SimAsym), 
while still capitalizing on the asymmetry to generate more suitable distri- 
butional similarity values for hypemym-hyponym word pairs. Equation (71) 
states the formula for the proposed conversion. A specific implementation on 
the KL divergence formula is given in equation (72) 



SimMax{wi,W2) = mSiX{SimAsym{Wi,W2),SimAsym{w2,Wl)) (71) 

KLDMax{w\,W2) = m&x{KLD{wi,W2),KLD{w2,wi)) {11) 

Another method to convert an asymmetric measure of distributional sim- 
ilarity (or relatedness) into a symmetric one is by taking the average (for- 
mula 73) of the two possible similarity values. A specific implementation on 
the KL divergence formula is given in equation (74) 

SimAvg{Wi,W2) = ■:r{SimAsym{wi,W2)+SimAsym{w2,Wi)) (73) 



KLDArgiwi,W2) = -iKLD{wi,W2) +KLD{w2,wi)) (74) 

I (P(w|w0log^+P(w|w.)log^\75) 

r^,rrw.A P{w\w2) P{w\wy)) 



weC(wi){JC(w2) 
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^uru.^\ P{w\w2) P{w\w2)J 



2 

weC{wi)UC(w2) 

1 I (P(w|wi)-P(w|w2))log^j^ (77) 



Determining the effectiveness of such conversions of existing asymmetric 
measures is part of our future work. 



4.6. How CRMs Fit 



The CRMs suggested by Weeds (2003) are the first distributional measures to 
be evaluated by comparing ranked word pairs with those ranked by humans 
(Miller and Charles word pairs). At first glance the CRMs may look quite 
distinct from the rest of the distributional measures studied so far, owing to 
their rather complex formulae and multiple optimizing parameters. However, 
setting the parameters to certain standard values equates a few of the CRMs 
to other measures. The difference-weighted token-based CRM suggested by 
Weeds has identical values for precision and recall. She proves that the pre- 
cision (or recall) is inversely related to the Li norm measure. This seemingly 
odd result of equating a distributional distance measure with a precision (or 
recall) value makes sense due to the following — as substitutability is de- 
fined as a measure of distributional similarity, metrics such as precision and 
recall which quantify how good the substitution is, reflect the distributional 
similarity and are inversely related to distributional distance. Thus setting 
Y = and p = 1 or 0, causes the CRM to behave like the Li norm. Further, 
as shown below, setting y = 1 (in other words, taking the F measure) makes 
the difference-weighted mutual information-based CRM identical to the mu- 
tual information-based Dice coefficient (34). Following*^ is a proof of the 
same. The precision and recall of the difference-weighted Ml-based CRMs 
are repeated here (equations (78) and (79)) for convenience. 



M//„, „, N i:wec(H,)ncK)min(/(w,wi),/(w,W2)) 
Pdw [Wl,W2) = ^^ , . (78) 

Rdw{w\.W2) = ^^ TT— -^ (79) 
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Theorem 2. The difference-weighted mutual infomiation-based CRM equates 
to the mutual information-based Dice coefficient if its parameter y is set to 1 . 



Proof. 

= 1 



2xPxR 
P + R 

IxPxR 



P + R 
IxPxR 



+ (1-Y) 
+ (1-1) 



P[P] + (1-P)[/?] 
P[P] + (1-P)[/?] 



P + R 

On substituting values for P and R from equations (78) and (79): 

S/m^J(wi,W2) 

I.weC{w^)nc{w2)^Hl{w,Wl),I{w,W2))\ ( I.^veC(wi )nc(w2) mi"U("'="'l )/(»'.'^'2)) 



E».ec(«',)''('*''^l) 



Eu.eC(»'2)^(^'"'2) 



E»'eC(M'i)nC(.v2) ™"(^("'."'l )'-'("'' "'z)) \ ( EM'eC(.v| )nc(»'2) min(/(w,wi ),/(w,W2)) 



EM'eC(w, ) I{w,wi ) 



+ 



EweC(H',)-'{w.M'2) 



2 I (£..-gC(w|)nC(H'2)'"'"(^("''"-'')'-^("''"'2))) 

(lweC(»-,)^(w,Wl))(£„,£c(w2)^(»'."'2)) 



(£».£C(wi )nc(.v2) mi"(^(w.wi ) ,/(w,W2 ) ) ) (E„.gc(».i ) K^^wi )+£h,£C(w2) ■^('•*'.*y2 ) ) 
(EweC(wi) ^Kwi)) (E„,ec(H'2) ■^("'."'2)) 



2 X I„ec(vyi)nc(w2)min(/(w,wi),/(w,W2) 

IweC(H.i ) ^(W, Wi ) + IvvGC(vv2) ^(W^' >^2) 

Drce'^^(wi,W2) 



D 



4.7. Hit and Miss Co-occurrences 

Lastly, we examine two kinds of co-occurrences that pose a challenge to ex- 
isting distributional measures: (1) Word pairs that occur together less number 
of times than what would be expected by chance. Measures like PMI cannot 
predict their association values with confidence and as pointed out earlier this 
is countered by ignoring them completely. This means that the system misses 
out on evidence from this set of co-occurrence pairs. (2) Co-occurrence pairs 
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formed by a word with target words that are neai^ synonyms. Inkpen and 
Hirst (2002) point out that near synonyms (for example, hidden and con- 
cealed) may form strong and anti-collocations, respectively, with the same 
co-occumng word (for example, agenda). All distributional measures that 
use strength of association to determine semantic relatedness will consider 
the large discrepancy in strength of association as evidence of unrelatedness. 
Therefore, these co-occunence pairs, which are not ignored (unlike the pre- 
vious ones), will negatively impact the ability of distributional measures to 
predict semantic relatedness of near synonyms. It should be noted that we 
cannot eliminate such co-occuiTcnces in a straightforward manner simply 
because we are not aware apriori if the target words are near synonyms. 
It would be interesting to detennine the precise quantitative effect of such 
co-occurrences on the performance of distributional measures. 

4.8. Summarizing the Distributional Measures 

In the last two sections we have seen numerous distributional measures. Ta- 
bles I, II, III, and IV listed in the appendix summarize their properties. 



5. Semantic Network and Ontology-Based Measures 

Creation of electronically available ontologies and semantic networks like 
WordNet has allowed their use to help solve numerous natural language prob- 
lems including the measurement of semantic distance between two words. 
Budanitsky (1999), Budanitsky and Hirst (2001) and Patwardhan et al. (2003) 
have done an extensive survey of the various WordNet-based measures, their 
comparisons with human judgment on selected word pairs, and their efficacy 
in applications such as spelling correction and word sense disambiguation. 
Hence, this paper provides just a brief summary of the major WordNet-based 
measures of similarity and focuses on their comparison with distributional 
ones. 

One of the earliest and simplest measures is the Rada et al. (1989) edge 
counting method. The shortest path in the network between the two target 
words (target path) is determined. The more edges there are between two 
words, the more distant they are. Elegant as it may be, the measure relies 
on the unlikely assumption that all the network edges correspond to identical 
semantic distance between the nodes they connect. Nodes in a network may 
be connected by numerous relations such as hyponymy, meronymy and so 
on. Edge counts apart. Hirst and St-Onge (1998) take into account the fact 
that if the target path consists of edges that belong to a number of such 
relations, the target words are likely more distant. The idea is that if we start 
from a particular node (base word) and take a path via a particular relation 
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(say, hyponymy), to a certain extent the words reached will be quite related 
to the base word. However, if during the way we take edges belonging to 
different relations (other than hyponymy), very soon we may reach words 
that are unrelated. Hirst and St-Onge's measure of semantic relatedness is 
Usted below: 

HS{ci,C2) =C —path length — kx d (80) 

where c\ and cj are the target concepts/words. And, d is the number of times 
an edge corresponding to a different relation than that of the preceding edge 
is taken. C and k are empirically determined constants. 

Leacock and Chodrow (1998) used just one relation (hyponymy) and mod- 
ified the path length formula to reflect the fact that edges lower down in the 
is-a hierarchy correspond to smaller semantic distance than the ones higher 
up. For example, sports car and car (low in the hierarchy) are much more 
similar than transport and instrumentation (higher up in the hierarchy) even 
though both pairs of words are separated by exactly one edge in the is-a 
hierarchy of WordNet. 

T^/ \ T len{cx,C2) ,„,, 

LC(ci,C2) = -log — (81) 

where D is the depth in the taxonomy. 

Resnik (1995) suggested a measure that used corpus statistics along with 
the knowledge obtained from a semantic network. The measure is based on 
the notion that the semantic similarity of two words may be determined from 
the word that represents their similarity (the lowest common subsumer or 
lowest super-ordinate (Iso)). The more the information contained in this 
node, the more similar the two words are. Observe that using information con- 
tent (IC) has the effect of inherently scaling the semantic similarity measure 
by depth of the taxonomy. Usually, the lower the lowest super-ordinate, the 
lower is the probability of occurence of the Iso and the concepts subsumed 
by it, and hence, the higher is its information content. 

Res{c]_,C2) = -\ogp{lso{ci,C2)) (82) 

As per the formula, given a particular lowest super-ordinate, the exact posi- 
tions of the target words below it in the hierarchy do not have any effect on 
the semantic similarity. Intuitively, we would expect that word pairs closer 
to the Iso are more similar than those that are distant. Jiang and Conrath 
(1997) and Lin (1997) incorporate this notion into their measures which are 
arithmetic variations of the same terms. The Jiang and Conrath (1997) mea- 
sure (denoted by JC) determines how dissimilar each target concept is from 
the Iso (/C(ci) —IC{lso) and IC{c2) —IC{lso)). The final semantic distance 
between the two concepts is then taken to be the sum of these differences 
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(see Budanitsky (1999) for more details). Lin (1997) points out that the Iso 
is what is common between the two target concepts and that its information 
content is the common information between the two concepts. Lin's formula 
(denoted by Lin) can thus be thought of as taking the Dice coefficient of the 
information in the two concepts. 



7C(ci,C2) = 2logp{lso{cuC2))-{log{pic,)) + ilogip{c2))) (83) 

J- f N 2xlogp{lso{ci,C2)) 

Lm{ci,C2) = - — — (84) 

I0g(p(ci)) + (l0g(/7(c2)) 

Budanitsky and Hirst (2001) show that the Jiang-Conrath measure has the 
highest con^elation (0.850) with the Miller and Charles word pairs and per- 
forms better than all other measures considered in a spelling correction task. 
Patwardhan et al. (2003) get similar results using the measure for word sense 
disambiguation (especially of nouns). 



6. Comparison of Distributional and Ontology-Based Measures 

Distributional and ontology-based measures use distinct sources of knowl- 
edge to achieve the same goal — the ability to mimic human judgment of 
semantic relatedness. Owing to the difference in methodology, many inter- 
esting comparisons may be made. The next few subsections aim at bringing 
them to light. 

6.1. Knowledge Source versus Similarity Measure 

Ontologies are much more expensive resources than raw data, which is freely 
available. Creating an ontology requires human experts, is time intensive 
and rather brittle to changes in language. Once created, updating an ontol- 
ogy is again expensive and there is usually a lag between the current state 
of language usage/comprehension and the semantic network representing it. 
Further, the complexity of human languages makes creation of even a near 
perfect semantic network of its concepts impossible. Thus in many ways 
the ontology-based measures are as good as the networks on which they are 
based. On the other hand, large corpora, trillions of words in size, may now be 
collected by a simple web crawler. Large corpora of more formal writing are 
also available (for example, the Wall Street Journal or the American Print- 
ing House for the Blind (APHB) corpus). Therefore, using an appropriate 
distributional measure that best captures the semantic similarity-predicting 
information, plays a much more vital role in case of distributional measures. 
As ontologies are a rich source of information where the various concepts 
are linked together by powerful relations such as hyponymy and meronymy. 
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the ontological measures likely correctly identify tai^get words related by 
edges that belong to just one relation as very similar. However, data sparse- 
ness may force distributional measures to assign low similarity values to 
clearly related word pairs. Assigning appropriate semantic similarity values 
when target words are connected by different relational edges poses a major 
challenge to ontological measures. 

6.2. Domain-Specific Semantic Similarity 

So far, this paper has talked about universal similarity measures. Given a 
word pair, the measures each give just one similarity value. However, two 
words may be very semantically similai^ in a certain domain but not so much 
in another. For example, the word pair space and time are closely related in 
the domain of quantum mechanics but not so much in most others. Ontolo- 
gies have been made for specific domains, which may be used to detemiine 
semantic similarity specific to these domains. However, the number of such 
ontologies is very limited. On the other hand, large amounts of corpora spe- 
cific to paiticular domains ai^e much easier to collect, allowing a widespread 
use of distributional domain-specific similarity. 

6.3. Associated Words 

Certain word pairs have a special relation with each other. For example, 
strawberry and cream, doctor and scalpel, and so on. These words are not 
similai- physically or in properties, but strawberries are usually eaten with 
cream and a doctor uses a scalpel to make an incision. An ontology-based 
measure will correctly identify the amount of semantic relatedness only if 
such relations are inherent in the ontology. For example, if the agent-instrument 
relation does not link concepts in a semantic network (as in WordNet), the 
ontology-based measures will not identify doctor and scalpel as related. 

Of the various distributional measures discussed, the ones that use simple 
co-occurrences capture such semantic relatedness, as words that tend to occur 
together are likely to have large set of common co-occurring words. Measures 
(e.g., Lin (1998a), Hindle (1990)) that consider a word w to be a shared co- 
occurrence only if w is related to both target words by the same syntactic 
relation, will not find such words related, simply because such words that 
tend to occur in the same sentence are likely to have different thematic roles 
and thus different syntactic relations with common co-occurring words. 

6.4. Multi-faceted Concepts 

The various senses of a word represent distinct concepts. Each of these con- 
cepts can usually be described by a number of attributes or features. These 
attributes may be physical descriptions like color, shape and composition or 
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Figure 1. Hierarchy Variations. 
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function, purpose and role. Two words are adjudged similar if they share a 
number of such attributes and if the strength of the shared attributes is high. 
By strength we mean how strongly an attribute helps define the words. The 
more prominent a shared feature, the more similar the two words are. Further, 
it is possible that words wi and W2 are related as they share a certain set of 
attributes, while W2 and wi, are related because they share a different set of 
attributes. Thus wi and wj, are likely not as related as wi and W2, or wi and W3. 
For example, the physical key is closely related to the abstract password, as 
they are both means of getting access. Password is closely related to encryp- 
tion as they both pertain to data security. However, the physical key has little 
to do with encryption and the two are not so much related. Thus semantic 
relatedness is not necessarily transitive and may be a function of a subset of 
relevant attributes, not necessarily all. 

Hierarchies in an ontology ar^e built by repetitive division of concepts as 
per their attributes. The order in which these attributes are used to create the 
tree structure can result in dramatically different hierarchies. For example, 
consider a scenario depicted in figure 1, where the attributes a\ and a2 are 
used in different orders to create different hierarchies of the words wi,W2,W3 
and H'4. Notice that while w\ and wj are closer to each other than w\ and w^ 
in hierarchy-1, it is the other way round in hierarchy-2. Thus variations in the 
order of use of attributes for creating the hierarchy can result in different sets 
of words being close to each other. 

It should be noted that real-world semantic hierarchies are created by well 
formed methodologies and hence the order of attributes used to create the 
hierarchy is not arbitrary. That said, there is room for variation and further, 
once a particular hierarchy is chosen, it captures certain semantic relations in 
its structure, while others are lost. 

In general, ontology-based measures of similarity capitalize on the prop- 
erty that words that occur close to each other in a hierarchy share a lot of 
attributes and are therefore similar. However, they usually rely on a fixed 
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hierarchy. Word pairs that would be closer in variations of the hierarchy are 
not considered. Thus ontology-based measures ai^e likely to wrongly assign a 
low semantic similarity value to such word pairs. For example, consider key 
and password. They are both means to gain access but the is-a hierarchy in 
WordNet lists them in completely different branches of the network (figure 2). 
The attribute determining whether the word refers to a physical entity or an 
abstraction is used first to classify the words and hence key and password fall 
into different branches at the top of the hierarchy itself. Thus an ontology- 
based measure is likely to find them um^elated. Distributional measures are 
not bound by a fixed hierarchy and have a better chance at appropriately 
identifying the semantic similarity of such word pairs. It would be worth 
determining the extent to which this is true. 




Object, physical object 



Whole, whole thing, unit 



Artifact, artefact 



Instrumentality, instrumentation 



Key 



Figure 2. key and password in the 'is-a' hierarchy of WordNet 



6.5. Evaluation and Complementarity 

Ontology-based and distributional measures of similarity have each been in- 
dividually shown to be reasonable quantifiers of semantic similarity. WordNet- 
based measures have been used for applications such as spelling correction 
and word sense disambiguation, while distributional measures have primarily 
been used for estimating probabilities of unseen bigrams. Exhaustive compar- 
isons of WordNet-based measures with each other (e.g., Budanitsky (1999), 
Budanitsky and Hirst (2001) and Patwardhan et al. (2003)) have found that 
the Jiang-Conrath measure performs better than the rest. 

Dagan et al. (1994) perform experiments with a few relative entropy- 
based measures and find that Jensen-Shannon divergence is slightly better 
than Kullback-Leibler divergence and L\ norm in estimating bigram prob- 
abilities of unseen words and in a pseudo-word sense disambiguation ex- 
periment. However, the various distributional measures have not been used 
to rank the Miller-Charles or Rubenstein-Goodenough word pairs, for which 
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estimates of human judgment of semantic relatedness ai^e available. Experi- 
ments to this end will also enable a comparison of the distributional measures 
with the ontology-based measures for which this data is already available. 
Similar to the case of various ontological measures, it is worth determining 
which distributional measure is closest to human notion of semantic similarity 
and how well the distributional measures, which rely just on raw data, fare 
against, the more expensive and knowledge rich, ontology-based measures. 

Since the two kinds of measures rely on different knowledge sources, 
there is a likelihood that distributional measures more accurately identify the 
semantic similarity of a certain subset of word pairs, while the ontological 
methods do so for a different subset. One of the more important problems 
in the field is to quantify this complementarity. It should be noted that since 
a similarity measure is evaluated by comparison of ranked word pairs and 
not by the similarity values alone, capitalizing on the complementarity by 
creating a combined semantic similarity predictor is a much hai^der problem. 



7. Conclusions 

The paper has provided a detailed analysis of various corpus -based distri- 
butional measures and compared them with measures based on ontologies 
and semantic networks. Merits and limitations of the various measures were 
listed. New measures that ai^e likely to overcome the drawbacks of present 
distributional measures were suggested. Specifically, a distributional mea- 
sure that keeps the best of Hindle (1990) and Lin (1998a), overcoming their 
respective drawbacks, was proposed. Variations of Kullback-Leibler diver- 
gence and Jensen-Shannon divergence that better capture the disparity in 
co-occurrence probabilities were suggested. A simple technique to convert 
asymmetric measures into symmetric ones was suggested. Novel approaches 
are described to determine distributional similarity by better utilization of 
co-occurring words related by different syntactic relations. 

The paper identified significant research problems that need to be an- 
swered through experimentation. This will help better understand how statis- 
tics from raw data may be manipulated to determine appropriate similarity 
values between two words. For example, whether the use of syntactically 
related, rather than plain co-occurrences, significantly improves the measure? 
Or, are simple co-occurrences just as useful? What kinds of co-occurrences 
(common, or exclusive, as well) should be used to determine distributional re- 
latedness? Is pointwise mutual information or conditional probability a more 
suitable measure of association to be used in the various distributional mea- 
sures? Do compositional or non-compositional distributional measures pro- 
duce more intuitive semantic similarity values? Which mathematical opera- 
tion (difference, division, or product) of the co-occurrence distributions yields 
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values that are closest to human judgment, in case of compositional mea- 
sures? A direct evaluation of the distributional measures (other than Li norm, 
a skew divergence and the CRMs for which these results exist) by their cor- 
relation with the Miller-Charles and Rubenstein-Goodenough word pairs will 
provide better insight into their relative abilities and will enable a compari- 
son with WordNet-based measures for which the coiTclation coefficients are 
already available. 

Lastly, the paper pointed out that even though ontological measures ai^e 
likely to perform better as they rely on a much richer knowledge source, 
distributional measures have certain distinct advantages. For example, they 
can easily provide domain-specific similarity values for a lai^ge number of 
domains, their ability to determine similarity of contextually associated word 
pairs more appropriately, and the flexibility to identify multi-faceted concepts 
as related from appropriate commonalities that may not be explicitly en- 
coded in a semantic network. Thus it is very likely that ontological measures 
are better at predicting semantic similarity for certain word pairs, while the 
distributional measures do so for others. To identify the extent of this comple- 
mentarity and a suitable combined methodology to assign semantic similarity, 
remain significant problems in the field. A significant challenge in achieving 
this is how to reconcile the nature of the two kinds of measures — while 
ontological measures predict the semantic similarity of two concepts (or word 
senses), distributional measures do so for two words. One of the problems we 
intend to pursue is the development of a methodology that enables the use of 
distributional measures to predict semantic similarity of concepts, with no or 
little sense-tagged data. 

Appendix 

.1. Co-occurrence Retrieval Models 

The precision and recall of additive and difference-weighted CRMs (Weeds, 
2003). 
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where C(x) is the set of all co-occuixences of word x. Note that in case of 
the difference-weighted token and mutual information-based precision and 
recall formulae, there is a cancellation of a pair of tenns obtained from the 
core formulae and the penalty. 

.2. Summarizing Tables 

Tables I and II list the measures of distributional distance while tables III and 
IV list the measures of distributional relatedness/similarity. If a measure is 
placed in a distributional distance table, it means that the intuition behind the 
measure lead to its original conception as a distance measure and similarly 
for a relatedness measure. It should be noted however that a distance measure 
may be converted into a relatedness measure by taking the inverse or other 
such mathematical manipulation, and vice versa. Apart from the formula, the 
tables show whether the measure is compositional (Comp.) or not, and if so 
then the kind of primary compositional measure (PCM) it is derived from. 
The last column (Str.) indicates the particular strength of association used 
(most commonly) in the measure — conditional probability (CP) or pointwise 
mutual information (PMI). 
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Table II. Measures of distributional distance (continued). 
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Table III. Measures of distributional relatedness/similarity. 
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Note: 'n.a.' stands for 'not applicable'. For example, the cos measure is not a composi- 
tional measure and therefore the type of PCM is not applicable. 
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Table IV. Measures of distributional relatedness/similarity (continued). 
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" The Ml-based CRMs use pointwise mutual information, while the type- and token-based 
CRMs use conditional probability as the strength of association. 



Distributional Measures as Proxies for Semantic Relatedness 45 

Acknowledgements 

We would like to thank Dr. Suzanne Stevenson, Dr. Gerald Penn, and Dr. Ted 
Pedersen for their valuable feedback and thought-provoking discussions. This 
research is financially supported by the Natural Sciences and Engineering 
Research Council of Canada and the University of Toronto. 



Notes 

' In their respective papers, Robert Fano as well as Ken Church and Patrick Hanks refer to 
pointwise mutual information as mutual information. 

^ It is hard to accurately predict negative word association ratios with confidence (Church 
and Hanks (1989)). 

^ In their respective papers, Donald Hindle and Dekang Lin refer to pointwise mutual 
information as mutual information. 

^ P is short for P{wi,W2), while R is short for R{wi,W2)- The abbreviations are made due 
to space constraints and to improve readability. 
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